This document gathers and illustrates work on a set of tools for analyzing twitter communities and their interaction, built with hadoop and R. The report is auto-generated by using scripts that first aggregate data in hadoop, and then invokes R tools that compile the markdown for this document while incorporating statistics, figures and tables generated from the data. The scripts are here, and the markdown source for this report here.

1 Pipeline overview

Data is gathered by using twitter’s public streaming API endpoint to extract tweets related to Spanish political parties. A Flume agent with a Twitter4j source is used to funnel data received from Twitter’s API onto disk. The custom Flume source can be configured to follow a number of users, track a number of phrases, and to filter tweets by language. An example configuration of the agent can be found here. Note that twitter account ids (instead of names) have to be used to follow users.

The current strategy collects all tweets that originate in the official accounts of the major political parties as well as their official leaders and spokespersons; as well as those which mention or retweet any of these accounts. In addition, a number of keywords related to the elections are tracked. The result is filtered to only include tweets in Spanish.

The resulting tweets are stored on a distributed file system (hdfs) as raw json files. They are arranged in daily folders, with individual files containing a roughly equal number of tweets. Hive provides an SQL-like view on these json files, with Hive tables being partitioned by day also. Various hql scripts export daily edge-lists for mentions and retweets and other aggregated summaries into local text files.

A number of different graphs, and tables can then be generated from the local data and inspected using the R tools. Graphs are distinguished by layer (retweet, mentions or both combined) and version (e.g. one for the period of the catalan elections and one for the general elections). Graphs are also locally cached as R data files, so they don’t have to be re-created for each analysis (unless new data needs to be incorporated).

2 Network level analysis

Each graph layer consists of vertices representing twitter accounts. Edges between those vertices capture the number of times a user A has retweeted another user B (in the retweet layer R), or how many times A has ‘mentioned’ B (in the mention layer M). First we will look at some overall statistics for the different graph layers.

In the following sections we’ll use tweets from the catalunya election period as an example.

2.1 Tweet frequency and volume

We start off by looking at the volume and frequency of tweets using an hql script that aggregates the number of tweets per hour. In total, for the catalunya version of the graph, we have 3.228.346 tweets in the period between 04-08-2015 and 07-10-2015 (65 days).

For a more detailed view we can look at tweet frequencies over time, at different levels of resolution:

On the left tweets counts are shown aggregated by day and for the complete period. On the right an hourly count is shown for a single week.

2.2 Descriptive statistics

The following tables provide information about very general statistics of the three graph layers before and after basic preprocessing. The preprocessing consists in

  • simplification of edges, i.e. collapsing parallel edges by summing their weights (counts of number of retweets or mentions). Though in theory there shouldn’t be any parallel edges to begin with.
  • filtering out edges with weights below some threshold (here 1 is selected, so no filtering)
  • filtering out vertices not belonging to the largest (weakly) connected component (this removes parts of the network that are disconnected from the main component).

Lastly, stray nodes resulting from the filtering (those not connected to any other) are removed too.

Note that the total number of tweets from the previous section (3.228.346), is not identical to the sum of retweet and mention edges or tweets. This is because retweets and mentions are not mutually exclusive. Retweets, for example, can also mention both the retweeted as well as other accounts. Equally a tweet may mention one or more accounts without being a retweet. As a result, the number of retweets only may be smaller than the total number of all tweets, as there will likely be mentions that are not retweets. On the other hand, the number of mentions may actually be greater than the actual total of tweets (e.g. if the average number of mentions per tweet is greater than 1). Also note that for the combined layer, the weights of edges (#tweets) between identical pairs of accounts in the retweet and mention layers are simply added, so the corresponding “tweet” numbers do add up. In contrast, the combined number of edges corresponds to the union of edges in both layers, so the number of edges do not necessarily add up. One layer may in fact be a subset of the other. E.g. in most cases it seems that if one user has retweeted another, then they will also have mentioned that other user. The converse is not generally true. Many users seem to mention others without ever retweeting them. As a result the set of connections formed by mentions may already contain all retweet connections, but not the other way round.

Next we can plot the edge weight and node degree distributions of the (preprocessed) retweet layer:

Figure: Log-log plots of network weight and degree distribution.

It’s is clear that both distributions follow a power law (as common in small-world networks, for example).

We can also find various node-centrality (importance) measures. E.g. here are the 5 most central twitter accounts with respect to in-degree (number of retweeters) and page-rank for the retweet layer:

Note that the in-degree here corresponds to the number of users having retweeted a particular account, not the number of retweets received (which requires taking into account the weight of each connection).

2.3 Graph

Plotting a graph with hundreds of thousands of nodes and millions of edges usually results in an unintelligeable “hairball”. We may however, filter the graph down to more manageable size, for example by looking at a daily snapshot.

The following figure shows the graph for the day of 27-09-2015: The visual graph representation is not particularly informative. The ten accounts with the highest in-degree are highlighted by their names. The unequal distribution of retweets is obvious from the fact that there are few nodes with high degree (encoded in the figure by vertex size) and many with significantly smaller degree.

3 Community level

The goal of the analysis tools is to understand the interaction between different network communities. To this end, in each graph layer communities are identified based on structural network properties. For graphs as large as those explored here, few community detection algorithms are sufficiently fast. In the following the “louvain”-method is used (see here e.g.), but others can be substituted too.

Eventually, we want to be able to follow community dynamics on a daily basis. But if we detect communities based on the graph of tweets for a particular day only, then the composition of communities may vary a lot from day to day. This is because an individual’s retweet and mention activity may be related mostly with one particular group of fellow tweeters one day, but a different group another day. However, if we take into account all tweets for the whole period considered, this should give us a more stable picture of which community an individual user generally belongs to. Hence community assignment is based on the graph of all tweets. When later analyzing activity on a per-day basis, instead of re-calculating communities for the smaller day-only network, we simply assign each node that day to the community it belongs to in the “global” graph.

As a first step we can identify what level of modularity the community partitioning has achieved. In the case of the “louvain”-method applied to the retweet layer this is 0.693.

Next we check the number and size of communities identified. There are in total 2.080 communities. Their sizes are distributed as follows (only the 25 biggest are shown and ordered by size):

On a log-scale this would likely be close to linear (indicating an exponential distribution of community sizes), as there are a few very large communities and a great number of small communities. In the presented form the communities are of little interest. What we’re really interested in is the ideological identity of these communities. We identify them here based on the presence of certain individual accounts in each community. I.e. given a map that assigns certain groups of individuals to their corresponding ideological affiliation, we can represent communities by that affiliation rather than the abstract index in the previous figure. To this end we have simply compiled a list (by hand) of important twitter accounts associated with each political party in Spain. E.g. for the party Podemos, the group of accounts associated with the party itself (@ahorapodemos), their election candidate (@Pablo_Iglesias_), and their spokesperson (@ierrejon) are assigned the affiliation “podemos”; and similarly for all other parties. Having identified these lists of party-related accounts, we can then simply check if each member of a list is also part of a particular twitter community. If that is the case for all members of a list, the community is equated with that party.

While in theory this doesn’t guarantee that all communities, which, remember, are identified solely based on structural network properties, can be uniquely mapped to a party, in practice we found that to be the case. So in the following figure we display again the size of communities, but now with more meaningful identifiers based on political parties:

Here, only those communities have been explicitly identified for which we have supplied the manual mapping of party affiliation. The remaining communities are subsumed under the label “unknown”. Also, for efficiency reasons the mapping of community to party is only done for the n largest communities, usually 10, as there can be hundreds to thousands of smaller communities. Those accounts not belonging to the 10 biggest communities are filtered out for the rest of the analysis (there are in total 12.770 of such accounts). We will also omit the group of “unknown” communities from further analysis, as well as remove accounts that end up isolated as the result of filtering out smaller and unknown communities.

3.1 Important community members

With community identities in place, we can determine the important members in each. E.g. the following tables show the 5 members with the highest in-degree (#retweeters) for the 6 biggest communities:

podemos
ahorapodemos Pablo_Iglesias_ ierrejon Juanmi_News gerardotc
23.061 21.289 11.526 3.802 3.400
pp
marianorajoy PPopular ANDRES_CANO42 jaimedeolano pablocasado_
8.807 8.587 2.206 1.655 1.639
catalunya
XSalaimartin Candeliano HiginiaRoig SergiCastanye Mas_Enfurecido
2.893 2.653 2.408 1.751 1.704
iu
agarzon iunida cayo_lara AntonioMaestre MarinaAlbiol
5.464 4.811 2.186 778 751
ciudadanos
Albert_Rivera CiudadanosCs InesArrimadas Schuma78 Cs_Madrid
12.305 10.427 2.886 2.681 1.959
psoe
PSOE sanchezcastejon gpscongreso europapress PSpresidente
7.765 6.217 898 757 690

3.2 Community coherence and interaction

Apart from the modularity measure given above, we can check how coherent the different communities are by measuring for each community the proportion of edges between members of the same community, and the proportion of edges members share with another community. If a community mostly retweets or mentions other accounts within the same community, then the corresponding value in the diagonal in the following interaction matrix should be large (close to 1), and the off-diagonal values low (close to 0):

Another way to look at the above interactions is to plot a graph in which all nodes belonging to the same community are collapsed into one, with edges between communities being the result of adding up all individual edges between them:

Here vertices are colored by their “official” party color, while their size corresponds to their degree and edge width is proportional to the number of tweets between communities.

3.3 Comparison of communities

We can compare communities by calculating standard structural graph measures for the their subgraphs, i.e. subgraphs constituted by the nodes belonging to a community and all existing edges between them. The following table lists a number of these measure for each community (TODO: short explanation of each):

Next we can pick out interesting community measures by identifying, for example, pairs with greatest spread (i.e. those best separating the communities, as measured by standard deviation), but little correlation (not measuring “the same thing”, small absolute correlation).

The following two plots indicate parties along pairs of measures (from the tables above), that show low correlation but large spread. Note however, that some measures may be related to the size of the community, and since sizes are substantially different, might not be easily comparable:

3.4 Graph

We can now also plot a more useful (filtered) version of the complete graph, taking into account community membership. Here, for example, is the graph with edges removed that have weights smaller than 6, and then nodes with degree smaller than 3. Nodes are coloured by political party, and a layout is used to place nodes that explicitly separates the communities.

4 Hashtag level

As a first step towards understanding the dynamics and interaction, rather than merely the structure of communities, we can analyze the frequency with which each community uses a given hashtag, dynamically, over time. I.e. for a given hashtag we can plot the time series given by the number of retweets for each community that use the given hashtag on a per-day basis.

The ultimate goal here is to compare the hashtag-frequency time series for different communities to determine their mutual influence. The question currently is whether at a sufficient resolution (e.g. hourly) there are enough tweets for a hashtag to perform information-theoretic analysis of the time series.

5 Media consumption

We identify media outlets of interest by manual inspection of the top 100 accounts in each community and creating the set of those which belong to tv, press or other media companies. Note (TODO): since we have previously filtered out all accounts not belonging to one of the big communities, some media outlets, namely those not belonging obviously to a community, won’t appear here. THis should probablt be changed.

The first plot shows the important media outlets, as well as the communities that they’re mostly affiliated with:

However, this doesn’t give us the whole picture. Although a media outlet may be mostly associated with one particular community (that retweets its news the most, for example), other communities may also interact with it.